WIERT: Web Information Extraction via Render Tree
نویسندگان
چکیده
Web information extraction (WIE) is a fundamental problem in web document understanding, with significant impact on various applications. Visual plays crucial role WIE tasks as the nodes containing relevant are often visually distinct, such being larger font size or having brighter color, from other nodes. However, rendering visual of page can be computationally expensive. Previous works have mainly focused Document Object Model (DOM) tree, which lacks information. To efficiently exploit information, we propose leveraging render combines DOM tree and Cascading Style Sheets (CSSOM) contains not only content layout but also rich at little additional acquisition cost compared to tree. In this paper, present WIERT, method that effectively utilizes based pretrained language model. We evaluate WIERT Klarna product dataset, manually labeled dataset renderable e-commerce pages, demonstrating its effectiveness robustness.
منابع مشابه
Personalizing Web Publishing via Information Extraction
because Web search and navigation are still underdeveloped. Although Web publishing is increasingly successful, it still requires too much time and effort to precisely locate specific information. This process is often tied to traditional solutions developed outside the Web scenario—for example, information retrieval (IR) models over hypertext rather than simple text documents. Moreover, even d...
متن کاملLearning n-ary tree-pattern queries for web information extraction
The problem of extracting information from the Web consists in building patterns allowing to extract specific information from documents of a given Web source. Up to now, most existing techniques use string-based representations of documents as well as string-based patterns. Using tree representations naturally allows to overcome limitations of string-based approaches. While some tree-based app...
متن کاملW Web Information Extraction
Information extraction (IE) is the process of automatically extracting structured pieces of information from unstructured or semi-structured text documents. Classical problems in information extraction include named-entity recognition (identifying mentions of persons, places, organizations, etc.) and relationship extraction (identifying mentions of relationships between such named entities). We...
متن کاملPersonalized Web Services for Web Information Extraction
The field of information extraction from the Web emerged with the growth of the Web and the multiplication of online data sources. This paper is an analysis of information extraction methods. It presents a service oriented approach for web information extraction considering both web data management and extraction services. Then we propose an SOA based architecture to enhance flexibility and on-...
متن کاملWeb Information Extraction Systems for Web Semantization
In this paper we present a survey of web information extraction systems and semantic annotation platforms. The survey is concentrated on the problem of employment of these tools in the process of web semantization. We compare the approaches with our own solutions and propose some future directions in the development of the web semantization idea.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2023
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v37i11.26546